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Central to several objective approaches to Bayesian model selec- 
fvj ■ tion is the use of training samples (subsets of the data), so as to allow 

utilization of improper objective priors. The most common prescrip- 
tion for choosing training samples is to choose them to be as small 
U~^ • as possible, subject to yielding proper posteriors; these are called 

C/j ' minimal training samples. 

(-H , When data can vary widely in terms of either information content 

or impact on the improper priors, use of minimal training samples can 
be inadequate. Important examples include certain cases of discrete 
data, the presence of censored observations, and certain situations 
involving linear models and explanatory variables. Such situations 
require more sophisticated methods of choosing training samples. A 
^ ' variety of such methods are developed in this paper, and successfully 

C^ applied in challenging situations. 

\^ \ 1. Introduction. Training samples play a central role in a variety of 

^-p ' statistical methodologies, including classification and discrimination, cross- 

f^ . validation, robustness and model selection, from both Bayesian and frequen- 

tist perspectives. Two recent developments in Bayesian model selection are 
the intrinsic Bayes factor of Berger and Pericchi (1996a) and the expected 
^ . posterior prior of Perez (1998) and Perez and Berger (2002). Central to 

H \ both is utilization of training samples to convert improper objective priors 

^ • into the proper distributions typically needed for model selection. The most 

K> , common prescription for choosing training samples is to choose them to be 
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2 J. O. BERGER AND L. R. PERICCHI 

as small as possible, subject to yielding proper posteriors; these are called 
minimal training samples. 

While fine for many problems, minimal training samples have been found 
to be suboptimal in an ever-increasing number of important statistical sit- 
uations, in particular those in which the data can vary widely in terms of 
information content. Important examples include the presence of censored 
observations, studied in Section 3; certain cases of discrete data, studied in 
Section 4; and situations involving unbalanced linear models or covariates, 
studied in Section 5. 

A variety of strategies have been developed to overcome the limitation of 
minimal training samples, and the main purpose of this paper is to outline 
these strategies. The generalizations of training samples considered herein 
can alternatively be viewed as choosing training samples in a random fash- 
ion, or as providing a "weighting" to chosen training samples. One partic- 
ularly interesting example is a sequential random minimal training sample, 
which is a training sample of smallest size such that the posterior is proper, 
but which is obtained by drawing observations randomly, without replace- 
ment, from the set of data. Another natural use of random training samples 
is when the original data is not available, but sufficient statistics are given; 
training samples can then be generated from the conditional distribution of 
the data, given the sufficient statistics. 

We will see considerable evidence that use of the new definitions of train- 
ing samples can successfully overcome a wide variety of problems in Bayesian 
model selection. It is worth noting up front, however, that we were unable 
to define any type of "optimal" training sample; the paper can thus be 
viewed as providing a useful set of strategies that can be employed to obtain 
good training samples, with statistical judgement being required to select 
from among these strategies in particular contexts. While this prevents the 
proposed model selection methods from being completely automatic, the 
judgements involved in choosing good training samples will typically be 
much less than the judgements needed to implement an actual subjective 
Bayesian analysis. See Section 6 for overall suggestions and further context 
concerning this issue. 

In the remainder of this section, the model selection problem is stated, and 
intrinsic Bayes factors and expected posterior priors are defined. Section 1.3 
discusses the key problem that arises, which can be best understood through 
the device of studying the intrinsic priors corresponding to intrinsic Bayes 
factors; these are the priors that, if used directly to compute Bayes factors, 
would yield (in an asymptotic sense) the same answers as the intrinsic Bayes 
factors. As further discussed in Berger and Pericchi (2001), we feel this to be 
a powerful unifying approach to understanding the performance of default 
Bayes factors. 
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There has been a significant hterature discussing training samples in these 
and other Bayesian contexts. Other recent articles include Gelfand, Dey and 
Chang (1992), de Vos (1993), Iwaki (1997, 1999), Lingham and Sivaganesan 
(1997, 1999), Alqallaf and Gustafson (2001) and Ghosh and Samanta (2002). 

1.1. Model selection notation. Suppose that we are comparing q models 
for the data X = {Xi , . . . , X^), 

Mi :x has density /j(x|0j), i = 1, . . . ,q, 

where the 6i are unknown model parameters. Let TTi{Oi), i = 1, . . . ,q, be 
prior distributions for the unknown parameters, and define the marginal or 
predictive densities of x, 

mi(x) = / fi{x\6i)Tri{6i)dei. 

The Bayes factor of Mj to Mi is given by 

(;L) q ^ _ "^i(x) _ Ifj{^\0j)^j{0j)d9j 



mi{x) J fi{x\ei)TTi{ei)dei 

and is often interpreted as the "odds provided by the data for Mj versus 
Mj." Thus Bji = 10 would suggest that the data favor Mj over Mi at odds 
of ten to one. Alternatively, Bji is sometimes called the "weighted likeli- 
hood ratio of Mj to Mj," with the priors being the "weighting functions." 
These interpretations are particularly appropriate when, as here, we focus 
on conventional or default choices of the priors. 

1.2. Intrinsic Bayes factors and expected posterior priors. For the q 
models Mi,...,Mq suppose that only noninformative priors 7rf{0i), i = 
l,...,q, are available. In general, we recommend that these be chosen to 
be "reference priors" [see Berger and Bernardo (1992)]. Define the corre- 
sponding marginal or predictive densities of x, 



mf(x)=y'/,(x|0j)7rf(0j)d0,. 



Unfortunately, the direct use of improper priors for defining Bayes factors 
in (1) is not generally justifiable [cf. Berger and Pericchi (1996a, 2001)], but 
they can be utilized for model selection through the introduction of training 
samples. Here is the standard type of training sample. 

Definition [Berger and Pericchi (1996a)]. A training sample, to be 
indexed by /, is a subset of the data, x(^). It is called proper if < mf{'x.{l)) < 
oo for all Mj. Let X denote the set of all proper training samples and define 
its cardinality as Lp. A training sample is minimal if it is proper and no 
subset is proper. A minimal training sample will be denoted MTS; let X 
and Lm denote, respectively, the set of all MTS and its cardinality. 
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Thus x(/) can be used to "convert" the improper 7rf{6i) to proper pos- 
teriors, 

(2) TT, (0.|X(O) = ^^^^^^^^ . 

These posteriors can then be used to define Bayes factors for the remaining 
data. 

Since there are typically many possible training samples, it is natural 
to average the resulting Bayes factors over the training samples in some 
fashion. The resulting Bayes factor for comparing Mj to Mi [called the 
intrinsic Bayes factor (IBF) in Berger and Pericchi (1996a)] is 

(3) 
where 

and "AVE" denotes an average of the BfA'x.{l)). A variety of possible aver- 
ages have been considered [see Berger and Pericchi (1996a, 2001)], the most 
common being arithmetic, geometric and median averages. Some recent ref- 
erences to use and development of intrinsic Bayes factors in various scenarios 
include Berger and Pericchi (1996b, 1996c, 1998), Bertolino and Racugno 
(1996), De Santis and Spezzaferri (1997), Lingham and Sivaganesan (1997, 
1999), Sun and Kim (1997), Berger, Pericchi and Varshavsky (1998), Key, 
Pericchi and Smith (1999), Moreno, Bertohno and Racugno (1998, 1999, 
2001), Bertolino, Racugno and Moreno (2000), Berger and Mortera (1999), 
Sivaganesan and Lingham (1999), Kim and Sun (2000), Rodriguez and Per- 
icchi (2001), Beattie, Fong and Lin (2002), Ghosh and Samanta (2002) and 
Paulo (2002). 

Another recent use of training samples for model selection is in the de- 
velopment of empirical expected posterior priors [Perez (1998), Perez and 
Berger (2001, 2002) and Neal (2001)], defined as 

(4) ^^(00=7^ y: ^fioMi)). 



B,, = Bf,.AYE[Bf^i^m, 




-^ and Bf^{l)=Bf^{^{l))-- 
mf (x) ■' ■' 


mf(x(0) 
mN(x(0) 



L 



M 



x(OeA'M 

The idea is that, instead of using the minimal training samples to define 
proper posteriors for computation of Bayes factors and then averaging the 
ensuing Bayes factors, one can first average the proper posteriors and then 
compute Bayes factors with the results. This approach can be embedded 
within Markov chain Monte Carlo analysis, which can be a considerable 
computational advantage. Another advantage is that one can use minimal 
training samples for each separate model, which has certain computational 
and theoretical benefits. 



TRAINING SAMPLES IN MODEL SELECTION 5 

1.3. Evaluation of intrinsic Bayes factors and a key condition. The most 
basic approach to evaluation of intrinsic Bayes factors is simply to see if they 
produce sensible answers. In Berger and Pericchi (1996c, 2001) it is argued 
that the best way to study this is to determine the intrinsic prior cor- 
responding to an IBF. The intrinsic prior is that prior which would yield 
Bayes factors that are approximately equal to the IBF, in an asymptotic 
sense. If this intrinsic prior is sensible, then the IBF is judged to be sensible. 
The power and sensitivity of the use of intrinsic priors in appraising de- 
fault Bayesian model selection methods is illustrated in Berger and Mortera 
(1999) and Berger and Pericchi (2001); see also the Examples in Sections 3 
and 4 in this paper. It is particularly important to establish the existence 
(and sensibility) of intrinsic priors when new concepts are introduced (as 
here, to deal with censored data and other difficulties); such initial study 
can give considerable confidence that the new IBFs will work more generally. 

One can also use intrinsic priors directly as the conventional prior for 
model selection [cf. Sun and Kim (1997), Moreno, Bertolino and Racugno 
(1998, 1999, 2001), Bertohno, Racugno and Moreno (2000) Kim and Sun 
(2000), Cano, Kessler and Moreno (2002), Moreno, Giron and Torres (2004), 
Moreno, Torres and Casella (2002), Paulo (2002), Giron, Martinez and 
Moreno (2003) and Moreno and Liseo (2003)]; this is an attractive possi- 
bility, although it is often more computationally intensive than using the 
IBF directly. Indeed, analytic determination of intrinsic priors can itself be 
quite difficult, and they will frequently not have closed form expressions. 
[They can have expressions amenable to MCMC computation, however; see 
Perez and Berger (2002).] 

Computation of intrinsic priors corresponding to model selection requires 
an extension from the finite set of proper training samples for the existing 
data to a hypothetical sampling space of proper training samples, to be 
denoted by X , based on imagining availability of an infinite sequence of 
data. Choice of this sampling space is sometimes automatic, but sometimes 
involves judgement; an example of each is given below. Note that X^ will 
typically be considered fixed for all models under consideration, although 
there are situations (such as with expected posterior priors) in which X^ 
can be allowed to vary with the model. 

Example 1. Suppose Xi,X2,... are i.i.d. from the normal distribu- 
tion with unknown mean fi and variance cr^. For the usual reference prior, 
7r(//,(T^) = l/o"^, an easy computation shows that an MTS must consist of 
any two distinct observations. Thus, if we use the MTS notion to define 
training samples, it is clear that we should define X^ to be the set of all 
pairs of (distinct) observations from the hypothetical infinite population of 
normal observations having mean fi and variance a^. (The word "distinct" 
is theoretically superfiuous, since the distribution is absolutely continuous.) 
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Example 2. Consider a linear model in which observation Xi has asso- 
ciated /c-vector of covariates Dj, i = 1, . . . ,n. Suppose that an MTS would 
consist of any m observations for which the corresponding vectors Dj are 
linearly independent. If we wish to extend this definition to an infinite pop- 
ulation, it is necessary to decide if the covariates are viewed as fixed or 
themselves random. In the former case, we can simply imagine that the 
hypothetical infinite population arises from proportionally replicated co- 
variates. Letting D denote the n x k design matrix of fixed covariates, X 
can then be formally defined as the space of sets of m observations that 
arise by first randomly drawing m linearly independent rows from D, and 
then generating corresponding observations from the linear model. If the co- 
variates are considered random, one would first have to define the sampling 
distribution of covariates and then construct X^ by draws from the covariate 
distribution, followed by generation of observations from the linear model. 
In this paper we shall only consider the fixed covariates scenario. 

The special case of intrinsic priors that will be considered in this paper 
is that in which there are two models, Mq nested in Mi, and the arithmetic 
average is used in (3). Then the intrinsic prior is given by 

(5) vrl(0i)=7rf(0i)<M<i(X(/))|^^], 

where E^^ refers to expectation under model Mi. This expression differs 
from the earlier expressions for an intrinsic prior, given in Berger and Per- 
icchi (1996a, 1996c), because of the conditioning on X . In the examples 
considered in these earlier papers, Pq^{X^) = 1, so that the conditioning 
was not needed. In general, however, the conditioning is needed to correctly 
define the intrinsic prior. 

One important property of a "good" intrinsic prior is that it integrate to 
one. [If it fails to do so, the corresponding IBF would appear to be "biased" 
toward one of the models; see, e.g., Berger and Mortera (1999) and Berger 
and Pericchi (2001).] Theorem 1 in Berger and Pericchi (1996a) asserts that 
this will be so (under mild regularity conditions) if ttq is proper (trivially 
satisfied if Mq is a simple hypothesis). Again, however, it was implicitly 
assumed that X had probability one; in this paper we formally state our 
assumption: 

Assumption 0. Pgl'iX^) = 1, i = 0, 1. 

In Sections 3 and 4, we will see that this assumption can be violated 
for the set of minimal training samples, in situations involving censoring or 
when inappropriate initial noninformative priors are utilized. 

If Assumption is satisfied and ttq is proper, then the intrinsic prior will 
be proper. For simplicity we only show this in the case when Mq is a simple 
model. 
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Lemma 1. // Assumption holds, Mq is a simple m,odel {i.e., 6q is 
specified) and the intrinsic prior is given by (5), then 



j7T\iei)dei = i. 



Proof. Since, by Assumption 0, X^ is the support of X(/) under Mi, 
and since X^ contains only proper training samples, it follows from (5) that 



A{ei)de,= fj ^^{e,) '^lf)l h{^{i)\e,)d^{i)de,. 

Applying Fubini's theorem to switch the order of integration yields 

\\{er)de^ = j^m^{^{i))d^{i) = p^;^{x') = i, 



the last step following from the assumption that Mq is simple and Assump- 
tion 0. 

D 

If Assumption does not hold, the intrinsic prior can be highly unsatisfac- 
tory (even improper, as we will see in later examples), casting considerable 
doubt on the quality of the associated IBF. Thus, if Assumption is violated 
in a particular context, the set of training samples should be enlarged until 
the assumption is satisfied. This can sometimes be done by changing the 
noninformative prior but, more generally, a more sophisticated definition of 
training sample is required. 

Note that under Assumption the intrinsic prior in (5) has the alternative 
representation 

(6) ^\{0i)= f vrf(0i|x(O)m^(x(/))dx(/), 

which is also called the base-model posterior expected prior in Perez and 
Berger (2002). If one is interested in utilizing the intrinsic prior directly in 
computing Bayes factors, this expression is typically most useful in that, 
within MCMC, one can simply drop the integral sign and treat x(/) as a 
latent variable. The improved training samples that are obtained in the 
following sections for IBFs can also be immediately utilized in (6) to obtain 
improved intrinsic priors that are computationally attractive. 

As a final comment, when tt^{Oq) is improper, then tt\{Oi) will also be 
improper. However, it is well calibrated with ttq^Oq), in the sense that a 
limiting argument over compact sets shows that the Bayes factor for the 
two priors is a well-defined limit of proper priors. See Berger and Peric- 
chi (1996a) for discussion and Moreno, Bertolino and Racugno (1998) for 
implementation. 
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2. Generalizations of training samples. To handle situations in which 
Assumption is violated and in which training samples can contain very dif- 
ferent information, it is necessary to introduce more general types of training 
samples. 

2.1. Randomized and weighted training samples. 

Definition 1. A randomized training sample with sampling mechanism 
p = (pi, . . . ,PLp), where p is a probability vector, is obtained by drawing a 
training sample from X^ according to p. Alternatively, the training samples 
can be considered to be weighted training samples with weights pi. 

Example 3 (Sequential random sampling). We will be particularly in- 
terested in sequential minimal training samples (SMTS) that are each ob- 
tained by drawing observations from the collection of data x = {xi ,X2,- ■ ■ , x„} 
by simple random sampling (without replacement for a given SMTS), stop- 
ping when the subset so formed, x*(Q = (a:(/)i, . . . ,x(/)7v(z)), is a proper 
training sample. Note that N{1) is itself a random variable. Although intu- 
itively and operationally one obtains an SMTS by sequential random sam- 
pling, such training samples can also be described via Definition 1, with pi 
being the probability of obtaining the ith SMTS via sampling without re- 
placement from the set of observations, and all other proper training samples 
being assigned probability 0. 

Remark. When the Xi are i.i.d. and arise from an absolutely continuous 
distribution, then an SMTS will typically equal an MTS with probability 
one, since each distinct observation will typically have the same effect on 
posterior propriety. 

Example 4 (Sampling of minimal training samples). Often the number 
of minimal training samples Lm is extremely large, so that the computation 
of the averages in (3) can be very expensive. In such situations it usually 
suffices to just randomly choose minimal training samples [i.e., set pi = 1/Lm 
for x(Z) G X^ and set Pi = otherwise in Definition 1]. Indeed, in Varshavsky 
(1995) the theory of [/-statistics is used to indicate that it often suffices to 
randomly choose L = kn minimal training samples, where n is the sample 
size of the actual data and k is the size of the minimal training sample 
(assuming there is a fixed size) . This is clearly much smaller than the number 
of minimal training samples, (^) . (Unfortunately, precise guidelines as to the 
choice of L are not available, so a reasonable practical implementation is to 
start with the choice kn and increase L until the change in the resulting 
Bayes factor is sufficiently small.) 
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Example 5 (Probability proportional to information). Observations are 
often associated with covariates. In linear models for instance, a training 
sample x(Z) will typically have a corresponding "design matrix" of covari- 
ates D(/) and corresponding "information" proportional to |D(Z)'D(/)|. One 
could choose training samples with probability proportional to this informa- 
tion (or perhaps the square root of the information). This was proposed in 
de Vos (1993). 

On the other hand, one does not want training samples to be too informa- 
tive. Suppose, for instance, that almost all of the information in the entire 
sample is due to a single observation. Utilization of that observation as a 
training sample can be inappropriate, as will be seen in Section 5. Indeed, it 
is generally a good idea to restrict attention to training samples that contain 
only a modest fraction of the total information in the data, although this 
may not always be possible [cf. Rodriguez and Pericchi (2001)]. 

Example 6 (Random sampling to reach a given information level). An 
interesting variant of the sequential random sampling approach to construc- 
tion of a training sample is to stop, not when the training sample is proper, 
but when the training sample contains a certain amount of "information." 
We do not pursue this idea here. 

2.2. Imaginary training samples. A different notion that has been em- 
ployed [in, e.g.. Good (1950), Smith and Spiegelhalter (1980), Iwaki (1997, 
1999), Perez (1998), Rodriguez and Pericchi (2001), Ghosh and Samanta 
(2002) and Perez and Berger (2002)] is that of an imaginary training sam- 
ple: training samples are generated, not from the real data, but from some 
specified distribution. For instance, in model selection one might elicit a sub- 
jective predictive distribution, 7n*(x*), where x* is thought of as a "future" 
minimal training sample. One could then draw training samples from this 
distribution for Bayesian model selection, or use the associated expected 
posterior priors [see Perez and Berger (2002) for motivation and further 
discussion] . 

One potential difficulty with training sample methods is that often only 
sufficient statistics (and not the actual data) are available. Use of imaginary 
training samples can overcome this difficulty. 

Definition 2. A conditional imaginary training sample, for a situation 
in which only sufficient statistics from a model are available, is defined to 
be a training sample from the conditional distribution of the data given the 
sufficient statistics. 

If S is a sufficient statistic, the factorization theorem gives 

fi^\e) = gis\e)-h{^\s), 
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and we can repeatedly draw conditional imaginary training samples x* from 
the corresponding marginal distribution h{x.*\S). In computation of intrin- 
sic Bayes factors or expected posterior priors one then presumes that the 
imaginary training sample x* arose from the density /(x*|^). 

Example 7 (Example 1 continued). Let Xi, . . . ,Xn be an i.i.d. sample 
from the normal distribution with mean n and variance o"^, but suppose 
that only the sufficient statistics x and s^ = X]i(^j ~ ^)^ ^^^ reported, along 
with n. A very simple way to draw conditional imaginary training samples 
is to create the surrogate data set 

_ g 

X* = {Zi - Z) h X, i = 1, . . . , n, 

sz 

where the Zi are independent standard normal with sample mean and sum 
of squared deviations Z and s\, respectively. This surrogate data set clearly 
has the same sample mean and sum of squared deviations as the original data 
and is a draw from /i(x|x,s^). One can then choose training samples (recall 
minimal training samples were of size 2) from this surrogate data set. (Note 
that it is necessary to have n > 3 in order to have training samples that are 
not simply the entire data set.) One can also draw additional surrogate data 
sets if more training samples are needed (an advantage of using imaginary 
training samples). Imaginary training samples are used as if they were real 
training samples, that is, they are assumed to arise from the original normal 
distribution with ^ and a"^ . 

Example 8 (Poisson distribution). Suppose that X is a single realiza- 
tion from a Poisson distribution with mean 6T, arising as the number of 
rare events observed in a time period T. We consider testing of Hq ■.6 = 9q 
versus Hi :6 ^9q, utilizing the improper Jeffreys prior, t^i{6) = 6~^''^. 

A natural way to define imaginary training samples is to use the fact that 
such a Poisson X can be viewed as arising from a sum of the indicators 
of events occurring with exponential inter-arrival times. More precisely, for 
i = l,. . . , consider Xi ~ f{xi\9) = 6exp{—9xi), and define 

X = I first j such that 5j = ^ X, > T I - 1. 

Then X has the Poisson distribution with mean 9T. 

It is natural to utilize these latent {xi, . . . ,xx} to construct imaginary 
training samples. No simple trick is available as in the previous example, 
so we must determine h{xi, . . . ,xx\X). Computation yields that this is the 
uniform density on J2i=i ^i < ^- Thus, if training samples consist of a single 
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observation (as is the case in the testing situation we consider with the Jef- 
freys prior), an imaginary training sample can be drawn from the marginal 
distribution of a single Xj arising from this uniform distribution, which is 

(7) Kx,\X) = ^(l-^Y~\ 0<x,<T. 

Single imaginary training samples can thus be drawn as X* = T[l — U^' ], 
where U is Uniform(0, 1). These are then used in constructing intrinsic Bayes 
factors and/or expected posterior priors, as if they had arisen from the 
exponential density with mean 1/0. Note that we have implicitly assumed 
that T > in defining the imaginary training samples. 

The situation is not always as nice as the above examples would suggest, 
in that the information needed to construct h{x.\S) in order to generate 
the imaginary training samples can be lost when a sufficiency reduction is 
effected. 

Example 9 (Linear model). Suppose Y(n x 1) arises from the linear 
model 

where (3 = (/3i, /?2; ■ ■ ■ ■, Pk)' is unknown, a'^ is known, and X is an (n x k) given 
design matrix of rank k <n. The least squares estimate j3 = (X'X)~^X'y 
is then sufficient for (3, and one might be presented only with n, (3 and 
its covariance matrix S = (T^(X'X)~^ after a sufficiency reduction. Prom 
this one cannot reconstruct the conditional distribution of the data given (3, 
because the design matrix, and hence the covariates, have been "lost" (unless 
n = k, in which case it can be reconstructed from S). So imaginary training 
samples cannot be generated in this way. For some ideas as to alternative 
ways of generating imaginary training samples in situations such as this, see 
Iwaki (1999). 

2.3. Utilization of generalized training samples. For training samples de- 
fined as in Definition 1 and considered as weighted training samples the 
arithmetic IBF and empirical expected posterior priors are defined, respec- 
tively, as 

(8) Bf, = Bf,Y^piBf^i^{l)), 

1=1 

(9) vrfP(0.)=^p,vrr(0.|x(O). 

1=1 
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It is often not feasible to compute these weighted averages (because of the 
large number of possible training samples), in which case it is easier to 
draw L random training samples, x(l),x(2), . . . ,x(L), according to the ran- 
dom schemes discussed above for generating the training samples (repeats 
allowed), and then just approximate the arithmetic IBF and empirical ex- 
pected posterior priors by, respectively, 

(10) Bf,^Bl\Y.BfM^)l 

(11) ^''{e,)^W^f{eMi))- 

^1=1 

It usually suffices to take L to be a modest multiple of the overall sample 
size n. 

3. Censoring. Censored data provides a key illustration of these ideas. 
We begin with an example involving right-censoring. For another discussion 
of training samples in the presence of censoring, see Lingham and Sivagane- 
san (1999). 

Example 10 (Right censoring of exponential data). Suppose the data 
xi, . . . ,Xn arises as a random sample from the right-censored Exponential(^) 
density; thus, if Xj < r it arises from the density f{xi\6) = 9ex.p{—6xi), while 
P{Xi = r\9) =p{6) = exp{—r9). It is desired to test 

Mo:e = eo versus Mi-.O^Oq. 

Consider the usual default prior ■k^{9) = 9~^. It is easy to show that any 
single uncensored observation yields a proper posterior, while no number 
of censored observations will do so. Hence the set of minimal training sam- 
ples X^ consists of the collection of single uncensored observations. Since 
censored observations never enter into the training samples, the MTS's will 
intuitively be biased in favor of larger values oi 9 = 1/ E[Xi\9)^ which seems 
undesirable. 

To evaluate the situation more carefully, consider the intrinsic prior for 
9 corresponding to the arithmetic IBF; this prior is given by (5), where 
the sampling space of training samples, here denoted X"^^, is simply the 
interval (0,r) [i.e., the space of single uncensored observations drawn from 
f{x\9,x < r)]. Note first that Assumption is violated, since 

pM, ^^Mi) ^ pM. (^x<r) = l- eM-rOi) < 1, i = 0, 1, 



TRAINING SAMPLES IN MODEL SELECTION 13 

so that we expect problems with the intrinsic prior (and hence with the 
intrinsic Bayes factor). Noting that the Bayes factor for a training sam- 
ple is Bqi{x) = 9oe:x.p{—9ox)/ J ^6exp{—9x)d6 = x9oexp{—9ox), the intrin- 
sic prior in (5) is given by 



i^m W fl f a ^_9expi-9x) 

^ i0) = n x9oexp{-9ox)- — —-dx 

9 Jo (1 -exp(-r6')) 



(l-exp(-r6')) 



1 e-(e+eo)r(^^ + 



' + Ooy \9 + eo {e + 9o) 



This is not a proper prior; indeed, as — > the prior behaves like a constant 
times 1/9, which is nonintegrable, a particularly egregious failing. 

One possible solution to this problem would be to use a noninformative 
prior that enlarges the set of MTS's. Indeed, for this problem involving right 
censoring the Jeffreys-rule prior is it (9) = 9~^[1 — exp(— r^)]^'^ [De Santis, 
Mortera and Nardi (2001)]. For this prior, it can be shown that any sin- 
gle observation, censored or uncensored, is an MTS, so that Assumption 
is trivially satisfied and the resulting intrinsic prior must integrate to one. 
Note, however, that extra work is involved in finding the Jeffreys-rule prior, 
and this can be formidable in more complex situations (e.g., in Example 
11). Furthermore, the intrinsic prior that results from use of the Jeffreys- 
rule prior here has the quite unappealing property (see the Appendix) that 
its median is 0{r~^) as r — > 0. This unattractive behavior arises because 
the highly informative training samples (the uncensored observations) have 
effects averaged with the (many more) censored observations that have neg- 
ligible information content as r ^ 0. Hence we turn to use of sequential 
minimal training samples to solve the problem. 

For the prior vr (^) = 9~^ a SMTS is of the form x(/) = (r, . . . ,r,x{l)), 
where x{l) is the first uncensored observation that arises in simple ran- 
dom sampling (without replacement) from the data. (In contrast, none of 
the r would be present in an MTS.) The natural sampling space for such 
training samples is the set X of possible sequences x(/) = {r,...,r,x{l)) 
of i.i.d observations arising from the censored exponential distribution. Let 
N*{1) = N{1) — 1 denote the number of censored observations in the SMTS 
from A'S\ and write p{9) = P{X > r\9) = exp{-9r). Note that P(7V*(/) = 
j\9) = (1 -p[e))p{9y, and that the joint density of x(0 is /(x(/)|6') = 
p{9y9exp{-9x{l)). 

Letting n„ denote the number of uncensored observations in the actual 
data, and letting T denote the sum of all observations (censored and uncen- 
sored), computation yields 



771]^ (x) 

m^(x) 



< = ^^^=r(n„)(m)--e^^o, 
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<(x(/)) = "'ll^flj =eo{N*{l)r + x{l))eM-[N*{l)r + x{l)]6o). 

The approximate arithmetic IBF in (10), corresponding to L random SMTS 
draws, is then 

1 ^ 

Bto = rK)(m)-""e^^° - ^ 9o{N*{l)r + x{l)) exp(-[iV*(Or + x(/)]0o). 

^1=1 

To investigate the behavior of this IBF, we again study its corresponding 
intrinsic prior. From (5) and noting that Pg'''{X ) = 1, this is given by 

7rl(e) = ^<M<i(x*(/))] 

-I OO prp 

(12) =7E/ Ooijr + x)eM-[Jr + x]eo)pieyeeM-Ox)dx 

^0 



{0 + eor' 

the last step following from standard calculations involving geometric se- 
ries. This is a very sensible intrinsic prior for the problem, being proper and 
having median equal to ^o- Indeed, this is the intrinsic prior for the expo- 
nential testing problem when no censoring is present and ordinary MTS are 
used [Pericchi, Fiteni and Presa (1993)], an appealing result. The indica- 
tion is that use of SMTS leads to a very satisfactory arithmetic IBF in the 
presence of censoring. 

It would be fascinating if the result observed in Example 10 — that the 
intrinsic prior in the presence of censoring and using SMTS equals the in- 
trinsic prior when there is no censoring and using MTS — held in general. 
Unfortunately, this is not the case, as can be seen by considering the density 
f{x\9) = (0.5)exp(— |x — 0|), together with a constant default prior on 0. De- 
tailed calculations yield that the intrinsic prior without censoring and using 
MTS is not equal to the intrinsic prior with right censoring and RMTS. We 
omit the details. 

That the intrinsic prior in (12) is proper would not have needed exact cal- 
culation. Indeed, consider the general case of censoring of i.i.d. observations, 
with a known censoring mechanism and the use of SMTS. Then the natu- 
ral sampling space is the set X of possible sequences of i.i.d observations 
arising from the original distribution (with censoring), with the sampling 
stopping the first time the training sample is proper. Assuming that the 
sampling is guaranteed to stop with probability one for any of the models 
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and parameter values under consideration (i.e., that the sampling mecha- 
nism is a proper stopping rule), then Assumption is satisfied and Lemma 
1 shows that the intrinsic prior is proper. 

When the censoring mechanism is at least partly unknown, intrinsic pri- 
ors cannot be defined. However, SMTS can be defined, and the correspond- 
ing IBFs or empirical expected posterior priors utilized to compute Bayes 
factors. We illustrate this with an example comparing two exponential dis- 
tributions. 

Example 11 (Comparison of two exponential populations). The follow- 
ing data, which appeared in Gehan (1965), were analyzed in Cox and Oakes 
(1984) as arising from (possibly censored) exponential distributions. The 
data show times of remission (as measured by freedom from symptoms) , in 
weeks, of leukemia patients, where the first group consists of control indi- 
viduals and the second group consists of individuals treated with the drug 
6-mercaptopurine. The data is as follows, where + indicates that the data 
has been censored. 

Control: 1, 1, 2, 2, 3, 4, 4, 5, 5, 8, 8, 8, 8, 11, 11, 12, 12, 15, 17, 22, 23. 
Treated: 6+, 6, 6, 6, 7, 9+, 10+, 10, 11+, 13, 16, 17+, 19+, 20+, 22, 23, 
25+, 32+, 32+, 34+, 35+. 

Notice that the control group has no censored observations, but more 
than half of the observations from the treated set have been censored. 

Following Cox and Oakes (1984) and with j = 1,2 referring to the con- 
trol and treatment groups, respectively, assume that the uncensored failure 
times tji follow the Exponential(0j) distribution. Write each observation as 
Xji = {yji,Vji), where yji = min(tjj, Cjj), with Cji denoting the censoring time 
(known for the actual data), and Vji = if tji < Cji (uncensored) and Vji = 1 
otherwise. Specifying the density here is problematical when the overall dis- 
tribution of the Cji is not known, but for Bayesian analysis we only need the 
likelihood function of {61,02) for the given data, and this is given by 

(13) 

where Uju and Ujc denote, respectively, the number of uncensored and cen- 
sored observations in each group, and the labels are rearranged if necessary. 
We want to test the hypotheses 

(14) Mo : 61= 02 = 9 versus Mi : ^1/^2 • 

In the analysis, we will utilize the usual noninformative priors t^q{6) = 9~^ 
and 77^(61,62) = 01^92^. As in Example 10, it then follows that an SMTS 
must consist of a sequence of censored observations from each group, followed 
by an uncensored observation. (Since in the actual data the control group 



2 
H9i,92) = [[ 


" "^ j u "^j c 

l[9,e-^^'^^Y[e 


j=i 


.1 = 1 4 = 1 
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contains only uncensored observations, an SMTS for this data will contain 
just a single control observation, but we will write down expressions for the 
general case.) Write this SMTS as 

(l\ _ f cu{l),...,ciN-*{l),tu{l), 

where Nf and N2 are the (random) stopping times in obtaining the SMTS. 
Straightforward calculation then shows the arithmetic IBF to be 

.,.. A _ r(ni^)r(n2.) (Ti +r2)"^"+"^" l ^ n{i)T2{i) 



r(ni„ + n2u) T^^-T^^- L ^ {T^{1) + T^m 



where 



niu n-ic "211 n2c 

A^r(o Af2*(o 

3^1(0= E Cl,(0+ill(0, ^2(0= E C2i(0+i2l(/) 

and L is the number of SMTS that are to be drawn. 

For analysis of the actual data above we computed (15) using L = n = 42, 
L = 2n and L = 5n training samples obtained by simple random sampling 
(without replacement) from the data. The resulting Bayes factors were Biq = 
544, 493 and 584, respectively, showing decisive evidence against the null 
model and only modest variation with respect to the number of training 
samples drawn. If equal prior probabilities are assumed for the hypotheses, 
then the posterior probability of Mi is about P(Mi|x) = 0.998. 

It is also straightforward to calculate the approximations to the empirical 
expected posterior priors, given in (11), and use them to compute the Bayes 
factor of Ml to Mq. The result is 

„EP _ ^jniu + l)r(?i2M + 1) 
^° ~ r{niu + n2u + 2) 

Ef=iri(0?^2(0(Ti +Ti(0)-("i"+i)(r2 + T2(0)-("^"+i) 

"" Ef=im(0 + r2(/))2(ri + T2 + Ti(0 + t2(0)"("1"+"^"+2) • 

For the data above and random training samples of sizes L = n = 42, L = 
2n and L = 5n, the resulting Bayes factors were Bfo" = 742, 713 and 728, 
respectively. These are similar to the arithmetic IBF, but are systematically 
somewhat larger, providing support for the suggestion in Perez and Berger 
(2002) that the empirical expected posterior priors will yield Bayes factors 
that are somewhat more favorable to the more complex model than IBFs or 
intrinsic priors. 
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Perhaps the most interesting feature of the above example is that Bayes 
factors and posterior probabiUties could be computed rather easily, without 
needing to know the nature of the censoring mechanism. In contrast, classical 
answers typically depend on the (often unknown) censoring mechanism. This 
is thus an important situation in which the objective Bayesian approach 
requires significantly less knowledge than a frequentist approach. 

Lack of knowledge of the censoring mechanism does preclude computa- 
tion of the intrinsic prior corresponding to the arithmetic IBF in censoring 
situations, however; without such knowledge, it is not clear how to define 
the sampling space for the SMTS, needed for computation of the intrinsic 
prior. Of course, one might reasonably "cheat" in this situation, using the 
suggestion from Example 10 that the intrinsic prior for SMTS and in the 
presence of censoring might well be close to the intrinsic prior for the prob- 
lem when there is no censoring (and MTS are used). One could then directly 
use these "approximate" intrinsic priors to compute the Bayes factor. 

Example 12 (Example 11 continued). An MTS in the uncensored ver- 
sion of this bi-exponential problem would consist of one observation from 
each of the control and treatment groups. Denoting this MTS by simply 
(ti,t2), the corresponding intrinsic priors are easily seen to be 'itq{0) = 
7rN(0) = e-i and 

Tr\{di,02)= / . , ^ exp(-ti6'i)exp(-t26'2)(itidt2. 

Jo Jo (*i + i2j 

Combining these intrinsic priors with the likelihood (13) and interchanging 
order of integration results in the Bayes factor 

^l _ r(ni^ + l)r(n2^ + l) ^^^ ^ j.^^m^+n2u 



(16) X 



r r ht2 1 , , 

Jo Jo {ti+t2f{Ti+tiY^^+\T2+t2Y^^+^ ' '■ 



For the data of Example 11 numerical computation yields -B}q = 503, a value 
quite close to those obtained with the approximate arithmetic IBF and using 
SMTS training samples. 

Another advantage of having (approximate) intrinsic priors, as above, is 
that they can be utilized to develop conditional frequentist tests. Indeed, the 
intrinsic prior above has been utilized in Paulo (2002) to develop optimal 
conditional frequentist tests for the bi-exponential testing problem. 
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4. Discrete examples. Difficulties with training sample approaches for 
discrete data have been highlighted in several papers [e.g., Bertolino and 
Racugno (1996), O'Hagan (1997) and Berger and Pericchi (1998)]. We first 
revisit one of the more vexing examples, to see if randomized training sam- 
ples fix the problem. 

Example 13 (Bernoulli testing). Based on n Bernoulli trials, with P{Xi = 
l\9)=e=l- P{Xi = 0|6l), it is desired to test 

Mo -.9 = 00 versus Mi: 9^ 9^. 

Suppose the improper Haldane prior 'k^{9) = 9~^{\ — 9)~^ is utilized to 
construct an IBF. This is a quite inferior noninformative prior, but it is 
interesting to see if IBFs can be made robust to poor choices of the initial 
noninformative prior. Note that for the Haldane prior 



B 



r{S)T{n-S) 
T{n)9^{l-9oY^-^)- 



where S is the number of ones in the data. 

With the Haldane prior an MTS must consist of precisely one 1 and one 0. 
(One and only one of each is needed for the resulting posterior to be proper.) 
Since Pg "-{X^) = 26{1 — 9) < 1, Assumption is clearly violated and the 
resulting IBF is again suspect. Indeed, noting that SgidO, 1}) = 9o{l — 9o), 
it is immediate from (5) that the implied intrinsic prior is 

(17) n\9) = ^'^^-^'^ 



9{l-9) 

This is itself improper — indeed it is simply a constant multiple of the original 
Haldane prior — and strongly suggests that the IBF for the Haldane nonin- 
formative prior and the usual definition of a minimal training sample do not 
correspond to a sensible Bayes procedure. 

An extreme case of this example arises when ^o = and the data consists 
of one 1 and the rest 0. O'Hagan (1997) noted that then Mq : ^q = is wrong 
with certainty (one cannot observe a 1 under Mq), yet the intrinsic Bayes 
factor will then equal l/(n — 1), for n>2. The basic problem, in this case, 
is that Pq '^{X'^) = 0, an extreme violation of Assumption 0. A single extra 
1 {S = 2) would solve the problem, making Biq = co (as it should be), but 
the behavior of the IBF is indeed disturbing when S =1. 

This extreme example is a good test of the effectiveness of SMTS. An 
SMTS will either be of the form x*(0 = (0, 0, . . . , 0, 1) or x*(/) = (1, 1, . . . , 1, 0); 
these can obviously be summarized by specifying Nq (the number of zeroes) 
and A'^i (the number of ones), respectively. Noting that Pq{Nq) = (1 — 9) ^''9 
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and Pe{Ni) = 9^^ {1-9), for A^o, A^i = 1, 2, . . . , it follows that 
Bl{N,) 



[I - ^o)^» 



S^9{l-9)N'^T^N{9)d9 
= No9^{l-9^f\ 

<i(iVi)=iVi(l-^o)C- 
To determine the intrinsic prior corresponding to the arithmetic IBF in 
(5), we first choose X to be the set of training samples, Nq and A'^i, arising 
from an infinite series of Bernoulli(0) trials. When < ^o < Ij it is clear that 
Pg '(A'^) = 1, so that the intrinsic prior is 

7r{(0)=^f(0)i?fM<i(X(/))] 



-^(1 



9^9Y,i[{l - 9){l - 9^)X + (1 - 0o)(l - e)Y,i[99^ 



%{l-Oo) 



+ 



Xi-{i-9){l-9o)Y {l-99of 

It can be verified that /q 7r|^(0) d9 = 1, so the intrinsic prior is proper. (With 
slightly less work, this also follows from Lemma 1.) Also, the intrinsic prior 
is admirably balanced, in the sense that the median is very close to ^o- 
[Numerical computation shows that 0.48 < P{9 < 9o) < 0.52 for all 6*0.] Thus 
all indications are that the use of the SMTS has corrected the problem caused 
by the bad initial noninformative prior. 

Of course, we needed the condition < ^o < 1 for the SMTS to work. For 
the extreme ^o = (or the case ^o = 1)) Assumption remains violated even 
for the SMTS; indeed, Pg °{X ) = in the extreme cases, so that no set of 
proper training samples can work. As an indication of the danger in using 
training sample approaches when Assumption is violated, consider again 
the situation considered by O'Hagan (1997). The arithmetic IBF, based on 
use of SMTS for the given data, can be computed to be Biq = (n^ — n + 
2)/[2?7-(n — 1)], which while an improvement over l/{n — 1), is still not oo, 
as it should be. Hence even use of SMTS cannot correct the situation when 
Assumption is violated. 

One might wonder if the the training sample solution fails as, say, ^o ~^ 
0. This is awkward to discuss in terms of the arithmetic IBF itself, since 
the sample size would correspondingly need to grow to oo before a proper 
training sample could be obtained. We thus look at direct use of the intrinsic 
prior (18) to see if it yields a satisfactory Bayes factor. Indeed, the resulting 
Bayes factor is 

,1 _ !^ 9^(1 - g)"-^[(l - (1 - g)(l - 9o))-^ + (1 - 99or^] d9 



^° ft?-'^{l-ftAn-S-l 
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For the problematical case S = 1, n>2 and 6q ^ 0, 



Jo 



h^' 



de, 



>I , / on Q\n~l 

/o 

which is infinite. Thus, for very small 9q and the observation S = 1, one 
would properly conclude that the alternative Mi is true. 



Next we revisit the Poisson example from Section 2.2, to see the effective- 
ness of imaginary training samples when only a sufficient statistic is given. 

Example 14 (Example 8 continued). Recall we are testing Hq:6 = 9q 
versus Hi:6 ^9q. For the Jeffreys prior \/\f6 under Hi computation yields 
that the formal Bayes factor is 

N_ r(x + i/2) 

Recall that we generate imaginary data x* and assume it to be exponential 
with mean 1/0. A single such observation is a minimal training sample. The 
arithmetic IBF in (10) is thus given by 

To study the performance of this objective Bayes factor we again deter- 
mine the corresponding intrinsic prior. Since the x* were actually generated 
from (7), the intrinsic prior in (5) is given by 

[The intrinsic prior, as defined in Berger and Pericchi (1996a), is based 
on letting the sample size go to infinity; for the Poisson problem the ana- 
logue of this definition is T ^ oo.] Since the integrand in (18) is bounded 
above by ^o(a^/*) exp(— 0o2;z*)/r(3/2), which is integrable, we may invoke 
the dominated convergence theorem to take the limit inside the integral. 
Furthermore, 

^lim -(^1-^j =9eM-0xn 
almost surely, so that 

1 /"OO O 

AO) = ^l ^^^{xtf/'eM-Ooxt)eexp{-ex^)dxt 

sOoVe 



2(6 + Oof /^' 
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This is a proper prior, and has median approximately equal to (1.7) Oq, 
a quite satisfactory prior. Hence the arithmetic IBF based on imaginary 
training samples arising from a single Poisson observation seems fine. 

5. Information-based training samples in the linear model. As men- 
tioned in Section 2.1, it is attractive to consider choosing training samples 
according to their information content. We begin with a classic example 
demonstrating the need to do this. [A related example can be found in 
Iwaki (1997).] 

Example 15 (Findley's example). Findley (1991) demonstrated the in- 
adequacy of BIC in the following situation. Suppose we observe Xi = diO + Ei, 
for 1 = 1,. . . ,n, and that the Ei are i.i.d. M{0, 1). It is desired to test 

Mo : 61 = versus Mi : 6* / 0. 

The standard noninformative prior is tt^{0) = 1, and the corresponding for- 
mal Bayes factor is 

where 

n 

A minimal training sample is a single observation rrj, and 



BSfe) = J|Lexp(-|), 



It follows that the arithmetic IBF is 

^i°- lldll'^Pl^ 2||d||2 jn^V2^'^Pl 2 

The interesting special case considered by Findley was di = i~^'^. Then 
as n — > oo it is straightforward to show that ||d|p = O(logn), 



„ ^2 
and 

.2 



^^^^^L^i^ = eHogn + 2Z9Vl^ + 0{l) 



iV^expf-^')=0(n-V2), 
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where Z is a standard normal random variable. It follows that 
Sfo = 0([nlogn]-^/2) gxp (i^2^Qg^ ^ ZeVlogn) 

Under Mq : = 0, it is clear that B^q = 0([nlogn]~^/^) ^ as n ^ oo; this 
is fine, as it indicates that Mq is true. But if Mi is true with 6^ <1, then 
B^ — > also, which means the arithmetic IBF is then inconsistent, a severe 
inadequacy. (If 9'^ = 1, the arithmetic IBF is consistent or not, depending on 
the sign of Z, that is, it will be consistent half the time.) The source of the 
problem (as with the associated inconsistency of BIC, as shown by Findley) 
is that the observations Xi contain drastically decreasing information, df = 
i~^, as i increases. 

This thus provides a good test for the idea of weighting the training 
samples by the amount of information they contain, that is, setting pi = 
(i?/||dp in Definition 1, and using the corresponding weighted IBF in (8). 
The resulting Bayes factor, using similar arguments to above, satisfies 

i?io = ^exp(^ 2||d||2 jE^]p71^-P(,-y 

= 0((logn)-3/2^^V2 gxp (zeVi^)). 

This still goes to under Mq (as it should), but now goes to oo under Mi 
(as it should). 

The use of weighted training samples solved the inconsistency problem, 
but that is a very crude criterion and the goal in use of training samples is to 
achieve actual Bayesian behavior. Unfortunately, even the use of weighted 
training samples fails this goal in this challenging situation. For instance, 
the weighted expected posterior prior in (9) for this situation is 

(19) ^rw=Ew^^^p(-^(^^-^^^)' 

Although this is, of course, proper, its variance can be shown to be 0(?i/logn), 
so that it becomes increasingly diffuse as n — > oo. Thus the limit is not a 
stable prior distribution, as one would want. 

The problem here is that the training samples corresponding to larger i 
simply have too little information for them to be useful as training sam- 
ples. This situation was also encountered in Rodriguez and Pericchi (2001) 
in dynamic linear models. Their reasonable solution was to only use the 
most informative training samples to develop intrinsic Bayes factors or ex- 
pected posterior priors. For instance, a simple modification of (19) would 
be to truncate the summation at some moderate value no (replacing ||d|p 
by the truncated sum), effectively assigning a "weight" of zero to the low- 
information training samples. This is an effective option in such situations. 



TRAINING SAMPLES IN MODEL SELECTION 23 

For the general linear model the above phenomenon can also be observed. 
For clarity we switch to a more standard notation for model selection in the 
linear model. Suppose for j = 1, . . . ,q that model Mj for the data Y (n x 1) 
is the linear model 

Mj:Y = Xj/3j. + Ej , Sj ~ AA„ (0, a|l„) , 

where a'j and /3j = {l3ji,Pj2, ■ ■ ■ iPjkjY are unknown and Xj is an (n x kj) 
given design matrix of rank kj < n. Let Rj = |(I — Xj(X'Xj)~^X')yp de- 
note the residual sum of squares for Mj . 

As usual, we utilize the reference prior irj^{f3j,aj) = cT as the initial 
noninformative prior. A minimal training sample y(/), with corresponding 
design matrix Xj(Z) under Mj, is a sample of size maxj/cj} + 1 such that 
all (X'(/)Xj(/)) are nonsingular; let L denote the number of such training 
samples. If kj > ki, 

C = r((n - k,)/2)T{{k, -h + l)/2)/[r((n - A:,)/2)r(l/2)] 

and 

i^,(/) = l(I-x,(/)(x;.(ox,■(/))~^x;.(/))y(/)|^ 

it is shown in Berger and Pericchi (1996b) that 

(20) B^- '^''^^''^' i^r^^^)/^ C^ |X;.(/)X,(0|V2 (i?,(/))V2 



j« 



|X;Xj-|l/2^(n-fc,0/2L^ |X^(0Xi(/)|l/2 (i?i(/))(fc.-fc«+l)/2- 



Problems can again arise here if too many of the |Xj(/)'Xj(/)| (which are 
proportional to the "information" in the training samples) are small. 

Example 16. Consider the special case of testing whether the slope of 
a linear regression is zero. Thus, let M\ be the model with only the constant 
term /3i and X'^ = (1, . . . , 1), and M2 be the model with (/3i,/32) and 

„,_/! ... 1 1 ... 1 1 
^2-1^0 ... 5 ... 6 1 

with m= (n — l)/2 being the number of zeroes and also the number of (5's. 
Let 6 be very close to zero. Minimal training samples are then of two types. 
The high-information minimal training samples are triples {yi,yj,yn}, where 
i^ j range from 1 to n — 1. There are m{2m — 1) such training samples, and 
they have |X2(/)X2(/)| = 2. The low-information minimal training samples 
include either one observation from the first m and two observations from 
the second m, or the reverse. There are rn^im — 1) such training samples 
and they have |X2(/)X2(/)| = 25"^ . Since 5 is very small, the low-information 
training samples contribute essentially zero to the expression in (20), so that 
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(with the high-information training samples labelled as 1=1,... ,m{2m — 
1)), 

21 |X'2X2|i/2^(-2)/2,„(m2 + m-l) ;et ^3 Ri{l) ' 

As 771 grows the term involving the training samples clearly goes to zero 
(since the residual sums of squares for the training samples can be shown 
to go to nonzero constants as 5 — > 0), an undesirable result. Giving equal 
weight to the (many more) low-information training samples has effectively 
washed out the effect of the high- information training samples. 

The natural solution to this difficulty in the linear model is to weight 
the training samples according to their information content, that is, choose 
p{l) oc |Xj(/)'Xj(Z)|. The problem discussed above will then disappear. In- 
deed, since there are plenty of high- information training samples available 
(if m is large), the weighted IBF will have a (nice) intrinsic prior. (This is 
in contrast to Example 15, where there were not enough high- information 
training samples to achieve this.) So here weighting works ideally. 

The Binet-Cauchy theorem yields the interesting result that 



(n-t)IX'X, 



(i.e., we know the normalization constant for the information-based weight- 
ing probabilities), and the weighted IBF then becomes 

A_ |x^x,|V^ig;"-^')/2 ^ c|x;.(ox,(/)|3/2 (i?,(/))V2 

J« I W .13/2 ^(n-kA/2 2^ ' 



|X;X,f /2 ^(-fci)/2 ;^ (n - A;,)|X^(/)Xi(0|i/2 {Riil))(''^~>'^+^)/^ ' 

(21) 

We do not yet have much experience with use of this IBF, but our current 
understanding suggests that this will often be better than the usual arith- 
metic IBF with MTS in linear models. The use of an approximation to a 
similarly weighted geometric version of the IBF was suggested in de Vos 
(1993). 

Finally, the same issue can be shown to arise with the expected posterior 
prior in the linear model, so that utilization of the weighted version 

tOO\ ^EP/^ 2\ sr^ |Xi(/)^Xj(/)| N^^ ^2i //NX 

(22) TTj (A,cJj = 2^ ^^_^ ,-^ TT^ {f3i,ai\y{l)) 



:{in-ki)\: 



I 



should be considered. 



While the purpose of this paper is not comparison of objective model 
selection procedures, it is worthwhile to pause and note that the examples 
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we have been considering are challenging for essentially any procedure. As 
an illustration, consider the most common objective prior used for Bayesian 
model selection with linear models, the g-prior, given by ■Ki{af) = l/crf and 

7ri(A|a2) isA4,(0,5CTf(X^XO-^). 

These were proposed in Zellner (1986) for estimation problems. The typical 
choice oi g is g = n. Zellner and Siow (1980) suggested a more appropriate 
(for testing) multivariate Cauchy form for the prior, but it shares with the 
(^-prior the underlying scale matrix S = n(T^(X^Xj)~^ which turns out to be 
quite problematical if it is highly unbalanced. 

Example 17 (Example 16 continued). Noting that the sample size here 
is n, computation shows that 



■ nal{l^^-K2r^=al(\ 



n 



so that the information available about Pi is vastly different from the infor- 
mation available about /?2. Indeed, using the ^-priors with g = n for both 
Ml and M2 results in the Bayes factor 

1 (y'y _ n/{n + l)y'X2(X^X2)-iX'2y)-^^/2 



B 



10 



V^m (y'y - n/{n + l)y'Xi(X;Xi)-iX;y) W2 " 
For large n and very small 6 [namely, 6 = o(n~^)], computation shows that 

Bio = ^ exp ' 



n V 2S'^/n 

where y and S*^ are the usual sample mean and sum of squared deviations. 
Since the exponential term is bounded in n, it follows that Biq — > as 
n grows. Hence this Bayes factor is inconsistent under Mi, a particularly 
troubling result. 

The difficulty here is that, in a sense, one would like to choose g = n for 
the information component due to /3i , but g = 1 for the component due to 
^2- The arithmetic IBF and empirical posterior prior (either the weighted 
or unweighted versions) do this type of adjustment automatically. [It should 
be mentioned that this would also cause a difficulty with fractional Bayes 
factors, unless differing fractions are allowed; see De Santis and Spezzaferri 
(1998a, 1999) and Berger and Pericchi (2001) for discussion.] 

In Example 5 it was noted that a problem can also arise with too in- 
formative training samples, and that it can be wise to restrict attention to 
training samples whose information content remains modest compared to 
the information in the entire sample. 



26 J. O. BERGER AND L. R. PERICCHI 

Example 18 (Example 15 continued). Consider the regression example, 
but with covariates di = i. Then the information is rapidly growing with i. 
The expected posterior prior in (19) can then be shown to have variance 
that is 0(n~^), so that the prior becomes increasingly (and arguably inap- 
propriately) concentrated as n — > oo. [The same is true if equal weighting 
is used for the training samples; hence the use of weights in (19) neither 
helped nor hurt.] Here, simply using only the first, say, no training samples 
(i.e., those with a modest amount of information) would avoid the problem. 

It is interesting to note that the common (7-prior in this situation has vari- 
ance n(^d^)~^ = 0(n~^), which inappropriately concentrates much faster 
than does the expected posterior prior in (19). 

6. Conclusions. It is notoriously difficult to develop model selection method- 
ologies that are successful over a wide range of problems. In judging success, 
our "goal" of developing objective procedures that behave like some reason- 
able Bayesian procedures may seem to be a rather modest criterion, but it 
is far stronger than any other criterion we know. We also feel that "testing" 
a procedure on extreme examples is by far the best method of judging the 
limits of the procedure, and in suggesting needed refinements. As we have 
tested intrinsic Bayes factors and expected posterior priors in the years 
since their development, it has become increasingly clear that the original 
suggestion — to always use minimal training samples — was too limited. This 
paper presented a summary of the highlights of these investigations and 
our suggestions for the needed refinements. The two major conclusions that 
emerged are: 

• In situations, such as censoring, in which certain observations would never 
be part of an MTS, instead utilize SMTS, which will allow possible in- 
volvement of all observations. 

• In situations, such as the linear model, in which MTS can contain drasti- 
cally different information content, consider weighting the training sam- 
ples (or randomly choosing them) according to their information content. 

Random training samples are also useful in other situations, such as when 
only sufficient statistics, not the actual data, are available. And there are 
further interesting possibilities that we have not explored, such as forming 
random training samples by sampling from the data until one has obtained 
a training sample with at least some pre-specified information content. 

Attention in this paper was primarily confined to the arithmetic IBF and 
the expected posterior prior. However, the generalizations of training sam- 
ples can (and should) also be used with other training-sample approaches. 
For instance, the geometric IBF can use the generalizations in exactly the 
same way as the arithmetic IBF. The median IBF is often preferable to the 
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arithmetic or geometric IBFs from a robustness perspective [Berger and Per- 
icchi (1998)], and randomized training samples can again be utilized directly 
in its computation. It is not immediately obvious how to utilize weighted 
training samples with the median IBF, however. The easiest approach is to 
draw random training samples with probabilities proportional to the weights 
and then use the median IBF with these training samples. 

Finally, it should be noted that this was not meant to be a survey paper, 
and so we have not dealt with all issues involved in suitably defining training 
samples. For instance, in Sivaganesan and Lingham (1999) it is shown how 
transformations of the data are sometimes needed to obtain suitable training 
samples. 

APPENDIX 

Lemma 2. In the situation of Example 10, use of the arithmetic IBF 
based on the Jeffreys-rule prior results in an intrinsic prior with median 
O(r-i). 

Proof. Since any single observation, censored or uncensored, is an MTS 
for the Jeffreys-rule prior, (5) leads to the following intrinsic prior: 



(23) TT\9)=7r\e) 



9oeM-0ox) a,^^^_o^^dx 



j7r\9)eeM-^x)d0 



exp(-6'or) 
^/^J(e)exp(-^r)de''''P^ 



To study the behavior of the median of this intrinsic prior as r ^ 0, note 
that the mass of the first term on the right-hand side of (23) is, switching 
order of integration, 

9oexp{—9Qx) dx 



= 1 - 6"^°'' ^0 as r ^ 0. 

Hence the median as r — > depends only on the second term on the right- 
hand side of (23). Computation shows that /7r^((9)exp(-6ir) ^6* = 1.5814 
(not depending on r). Also, exp(— 6*0?") ^ 1 as r ^ 0, so that the median, 
med(r), as r ^ is approximately given by the solution to 

/•mcd(r) 1 
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The change of variables y = r9, results in the equation 

0.7907 ^ / y"i (1 - eM-y)y ^ exp(-y) dy. 

Jo 

Solving this equation for rmed(r) results in the conclusion that med(r) = 
0.191/r, completing the proof. D 
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